Libraries

Typically, all libraries are loaded first as best practice. Here, however, we’re loading them where relevant so that you get an idea of where the functions from different libraries apply. The libraries we will use are:

We will also use one more package that is not available to download from the R repository:

We will download it instead from github as such:

library(devtools)
install_github("vqv/ggbiplot")

This might take a minute or two.

Early exploration

Let’s load tidyverse first. It’s a must toolset in R for data manipulation.

For more, check out: https://github.com/MaherSaid/ntc-workshop-tidyverse-workflow

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## Warning: package 'tibble' was built under R version 4.1.2
## Warning: package 'tidyr' was built under R version 4.1.2
## Warning: package 'readr' was built under R version 4.1.2
## Warning: package 'dplyr' was built under R version 4.1.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Now load the data. The data is in a serialized R format (.rds) which means it contains an R dataframe as is without any conversion (like to and from .csv).

df = readRDS("CEE377_SHIPDATA.rds")

Early exploration

First step of any project is to familiarize yourself with the data.

This dataset has multiple levels (remember, it came from an RDS file not CSV, meaning it can have lists, factors, etc). Since we’re not yet familiar with the data, let’s just look at the highest level of the dataframe using max.level=1.

str(df, max.level=1)
## 'data.frame':    300 obs. of  21 variables:
##  $ id                    : int  360 577 75 58 547 341 154 192 474 51 ...
##  $ indic_tech1           : num  4 4 4 3 2 5 4 5 4 3 ...
##  $ indic_tech2           : num  4 5 4 4 4 5 5 4 4 4 ...
##  $ indic_tech3           : num  4 4 3 2 4 5 4 3 3 4 ...
##  $ indic_tech4           : num  4 5 5 4 4 5 5 5 4 4 ...
##  $ indic_pack1           : num  2 2 3 3 3 4 4 2 2 3 ...
##  $ indic_pack2           : num  4 5 4 3 5 5 2 2 5 3 ...
##  $ indic_pack3           : num  4 5 4 4 5 4 2 3 5 4 ...
##  $ indic_pack4           : num  4 5 4 4 5 5 2 4 5 4 ...
##  $ indic_envr1           : num  4 4 4 2 2 2 2 3 2 4 ...
##  $ indic_envr2           : num  4 5 4 2 1 2 2 5 4 4 ...
##  $ indic_envr3           : num  3 5 3 4 4 2 4 5 4 5 ...
##  $ indic_envr4           : num  4 5 4 2 1 5 4 5 4 4 ...
##  $ age                   : num  70 20 23.5 32.5 50 40 40 20 32.5 20 ...
##  $ ethnicity             : Factor w/ 5 levels "Asian or Pacific Islander",..: 5 1 1 5 5 5 5 4 5 5 ...
##  $ occup                 : Factor w/ 6 levels "A homemaker",..: 5 2 2 3 3 3 3 2 1 3 ...
##  $ education             : Factor w/ 4 levels "Graduate or professional degree",..: 4 4 4 1 4 3 1 2 2 2 ...
##  $ hh_size               : num  2 3 4 4 6 2 3 5 5 3 ...
##  $ typical_mode_of_travel:List of 300
##  $ hh_type               : Factor w/ 3 levels "Rural","Suburban",..: 2 3 2 3 1 2 2 3 2 2 ...
##  $ hh_income             : num  62500 7500 42500 310000 87500 20000 125000 125000 62500 7500 ...

We should also look at the first few rows to better understand the format of the dataframe.

head(df)

Multiple packages have the function describe() which basically… describes the dataframe (counts, means, etc.). We’re going to use describe() from dlookr, but we’re not going to load the library since we’re only using a single function once (to avoid conflicts as will be demonstrated below). Instead, we call the function using the syntax library::function().

dlookr::describe(df)

If you look at the described variables, you notice variables like ethnicity and gender missing. That’s because those variables are categorical and dlookr::describe() doesn’t handle categorical variables (or factors). For more on factors, go to: https://r4ds.had.co.nz/factors.html

Instead, we’re going to use another describe() function from prettyR(). You can also see now that had we loaded the two libraries, we would have had a conflict due to two describe() functions. So, as before, we use the function without loading the library as prettyR::describe().

Here you’ll also see the first use of pipes (%>%). The concept of piping is that you can chain functions to one another without having to (1) create a variable for every step and/or (2) ending up with illegibly messy code.

%>% passes an element as the FIRST parameter to the next function in line. In other words, some_variable %>% round(2) is the equivalent of round(some_variable, 2). The way I like to describe %>% is by “then”. In the rounding example, we took some_variable then we rounded it.

Below, we will take df, then we will select() variables that are factors, then we will describe() those variables.

df %>%
    select(where(is.factor)) %>%
    prettyR::describe()
## Description of .
## 
##  Factor 
##          
## ethnicity  White Asian or Pacific Islander Black or African American
##   Count   199.00                        51                        24
##   Percent  66.33                        17                         8
##          
## ethnicity Hispanic or Latino Other
##   Count                   18  8.00
##   Percent                  6  2.67
## Mode White 
##          
## occup     Employed A student Out of work A homemaker Retired Unable to work
##   Count        168        63       28.00          18   13.00          10.00
##   Percent       56        21        9.33           6    4.33           3.33
## Mode Employed 
##          
## education Undergraduate degree High School Graduate or professional degree
##   Count                 139.00          78                           55.00
##   Percent                46.33          26                           18.33
##          
## education Trade/technical/vocational training
##   Count                                 28.00
##   Percent                                9.33
## Mode Undergraduate degree 
##          
## hh_type   Suburban Urban Rural
##   Count     178.00 80.00    42
##   Percent    59.33 26.67    14
## Mode Suburban

You can see that for factors, the descriptions focus on counts instead of statistics more relevant to numeric variables (mean, sd, etc.)

Plotting

Plotting is a must for early exploration. Since we’re focused on the indicators, let’s start getting an idea of what the response distributions look like for each. If wel just plot() the indicators directly, we won’t get anything very useful.

plot(df$indic_tech1)

The figure above plots a point for the response of each respondent. While we can extract some information from above (such as most responses being 4), let’s do a proper barchart.

First, get a frequency table() for the indicator.

table(df$indic_tech1)
## 
##   1   2   3   4   5 
##   2  19  54 165  60

Now plot that instead… and let’s do it for all indicators. R Markdown (which we’re using right now!) does a pretty nice job at handling lots of printed figures at once.

plot(table(df$indic_tech1))

plot(table(df$indic_tech2))

plot(table(df$indic_tech3))

plot(table(df$indic_tech4))

plot(table(df$indic_pack1))

plot(table(df$indic_pack2))

plot(table(df$indic_pack3))

plot(table(df$indic_pack4))

plot(table(df$indic_envr1))

plot(table(df$indic_envr2))

plot(table(df$indic_envr3))

plot(table(df$indic_envr4))

This is much better!

Plotting (advanced)

Those plots are informative, but suffer from two issues in my opinion:

  1. What if you had 1,000 indicators? 10,000? … or more? It’s good practice to write code that scales nicely.
  2. Those plots are 🤮.

For the novice in R, feel free to skip this section, but do admire the nicer plots as you scroll to the next section.

For those who stay, just note that you can run the code to any point in the chain to inspect what each line is doing.

# advanced
df %>%                                  # let's take the dataframe `df`
    select(indic_tech1:indic_envr4) %>% # select the columns from `indic_tech1` to `indic_envr4`
    pivot_longer(everything()) %>%      # pivot the data to a long format (compatible with ggplot)
    count(name, value) %>%              # get the count/frequency of every name (indicator) and value pair
    
    ggplot(aes(x = value, y = n)) +     # initiate ggplot with `x = value` and `y = n`
        geom_bar(stat='identity') +     # draw a barplot (`stat='identity'` tells `geom_bar()` to use our specified `y` variable)
        facet_wrap(~name)               # create a "window" for every indicator

# more advanced
df %>%
    select(indic_tech1:indic_envr4) %>%
    pivot_longer(everything()) %>%
    count(name, value) %>%
    mutate(category = str_extract(name, "_.*"),         # extract any text (.*) after the underscore (inclusive)
           category = str_sub(category, 2, 5)) %>%      # take only the 2nd to 5th character (removes the _ and number)
    
    ggplot(aes(x = value, y = n, fill = category)) +    # now let's also color the plots based on `category` that we just created
        geom_bar(stat='identity') +
        facet_wrap(~name) +
        theme_bw()                                      # ... and we make our plots less 🤮 by using a nicer theme

Back to more basic plotting

You should also experiment with plots for other variables in the dataset to get familiarized with them. Here’s a few examples. Notice the different approach for continuous and categorical data.

hist(df$age)

hist(df$hh_income)

plot(table(df$ethnicity))

Correlelograms

corrplot is a great library for correlelograms and another very important tool for your second stage of exploration.

library(corrplot)
## Warning: package 'corrplot' was built under R version 4.1.2
## corrplot 0.92 loaded
library(polycor)
## Warning: package 'polycor' was built under R version 4.1.2

typical_mode_of_travel is of type list, which is messy to deal with, so let’s remove it from the dataframe for now. Elisa already taught you how to create dummy variables, but we’re not going to do that today.

df = df %>%
    select(-typical_mode_of_travel)

Remember how we have categorical variables? Well.. you can’t really get the correlations between gender and age the same way you would between age and income. What even is the correlation between male and 35 years old? We’re not going to dive into the details of how to achieve this, just know that R has you covered.

The function hetcor() from polycor can handle any type of correlation (continuous-continuous, continuous-categorical, categorical-categorical) automagically.

Two things we want to be aware of. First, we don’t want to measure correlation with the column id as that would be spurious. Also, we have a few observations missing in hh_income, so we want to select the option use = "pairwise.complete.obs" so that the correlation is only calculated across complete pairs (and doesn’t throw out an error… hetcor() is really verbose about errors).

We will then pull the correlation values from the hetcor object and save them to cor.mat.vals.

cor.mat = df %>%
    select(-id) %>%
    hetcor(use = "pairwise.complete.obs")
## Warning in FUN(X[[i]], ...): polychoric correlation between variables ethnicity and education produced warnings:
##    NaNs produced
##    NaNs produced
cor.mat.vals = cor.mat$correlations

print(cor.mat.vals)
##             indic_tech1  indic_tech2  indic_tech3 indic_tech4  indic_pack1
## indic_tech1  1.00000000  0.588831941  0.447521608  0.31948743  0.181949635
## indic_tech2  0.58883194  1.000000000  0.611259518  0.29472088  0.214732776
## indic_tech3  0.44752161  0.611259518  1.000000000  0.23574305  0.222736134
## indic_tech4  0.31948743  0.294720883  0.235743049  1.00000000  0.051672637
## indic_pack1  0.18194964  0.214732776  0.222736134  0.05167264  1.000000000
## indic_pack2  0.05369822  0.096164647  0.087152241  0.06881749 -0.002841361
## indic_pack3  0.03227011  0.101311468  0.102162560  0.03417269  0.053093791
## indic_pack4  0.17975141  0.259669691  0.169444185  0.16045295 -0.075421230
## indic_envr1  0.09128080  0.164506379  0.137330847  0.04916577  0.204489499
## indic_envr2  0.12948287  0.115219271  0.132369497  0.06021463  0.054095856
## indic_envr3  0.08140807  0.199452262  0.151351781  0.07469534  0.030448584
## indic_envr4  0.23273764  0.189801864  0.151369643  0.21289420  0.026907020
## age         -0.15876302 -0.147619366 -0.153034684 -0.21297478 -0.047238492
## ethnicity   -0.12841369 -0.131702995 -0.093326209 -0.07505621  0.069280301
## occup       -0.09786140 -0.064914488 -0.071573128 -0.14539719 -0.081324830
## education   -0.08198512  0.018903777  0.001403611 -0.02537182 -0.130272842
## hh_size      0.07487267  0.064788380  0.077497761  0.06050230 -0.010196959
## hh_type      0.02073135  0.002977609  0.069979587  0.03284566 -0.043381783
## hh_income    0.14632059  0.117185841  0.144151340  0.02380804  0.162056091
##              indic_pack2  indic_pack3  indic_pack4 indic_envr1 indic_envr2
## indic_tech1  0.053698216  0.032270108  0.179751406  0.09128080  0.12948287
## indic_tech2  0.096164647  0.101311468  0.259669691  0.16450638  0.11521927
## indic_tech3  0.087152241  0.102162560  0.169444185  0.13733085  0.13236950
## indic_tech4  0.068817489  0.034172686  0.160452955  0.04916577  0.06021463
## indic_pack1 -0.002841361  0.053093791 -0.075421230  0.20448950  0.05409586
## indic_pack2  1.000000000  0.469915917  0.352406192  0.04311226  0.02199200
## indic_pack3  0.469915917  1.000000000  0.350514527  0.11239166  0.04884682
## indic_pack4  0.352406192  0.350514527  1.000000000  0.02342612  0.04620530
## indic_envr1  0.043112258  0.112391660  0.023426125  1.00000000  0.55275104
## indic_envr2  0.021992002  0.048846819  0.046205305  0.55275104  1.00000000
## indic_envr3  0.121543438  0.237630239  0.303421928  0.24069106  0.19431004
## indic_envr4  0.146741358  0.130749614 -0.003522792  0.43674625  0.47374619
## age         -0.039176397 -0.049273591  0.015458997 -0.22798746 -0.11305847
## ethnicity   -0.050380662 -0.002410978 -0.113081586 -0.15600541 -0.04616791
## occup        0.004556612 -0.093337044  0.035929371 -0.10735247 -0.01548385
## education   -0.001618468  0.048413673 -0.008582287  0.00708985 -0.02999150
## hh_size      0.090571556  0.082307334 -0.001595527  0.02837582 -0.02519878
## hh_type     -0.101028818 -0.036027214 -0.047401576 -0.06901130  0.02607995
## hh_income    0.021110183 -0.004602605 -0.170745685  0.08825433  0.01809673
##             indic_envr3  indic_envr4         age    ethnicity        occup
## indic_tech1  0.08140807  0.232737635 -0.15876302 -0.128413687 -0.097861405
## indic_tech2  0.19945226  0.189801864 -0.14761937 -0.131702995 -0.064914488
## indic_tech3  0.15135178  0.151369643 -0.15303468 -0.093326209 -0.071573128
## indic_tech4  0.07469534  0.212894201 -0.21297478 -0.075056214 -0.145397185
## indic_pack1  0.03044858  0.026907020 -0.04723849  0.069280301 -0.081324830
## indic_pack2  0.12154344  0.146741358 -0.03917640 -0.050380662  0.004556612
## indic_pack3  0.23763024  0.130749614 -0.04927359 -0.002410978 -0.093337044
## indic_pack4  0.30342193 -0.003522792  0.01545900 -0.113081586  0.035929371
## indic_envr1  0.24069106  0.436746254 -0.22798746 -0.156005411 -0.107352473
## indic_envr2  0.19431004  0.473746192 -0.11305847 -0.046167908 -0.015483852
## indic_envr3  1.00000000  0.206540139 -0.15340400 -0.109137924 -0.121448192
## indic_envr4  0.20654014  1.000000000 -0.21123166  0.001675920 -0.014525843
## age         -0.15340400 -0.211231660  1.00000000  0.332826385  0.383627783
## ethnicity   -0.10913792  0.001675920  0.33282639  1.000000000  0.262644734
## occup       -0.12144819 -0.014525843  0.38362778  0.262644734  1.000000000
## education   -0.07803055  0.027257034 -0.13273417 -0.158198201 -0.134291639
## hh_size      0.03821027 -0.004811397 -0.28794720 -0.186132449 -0.339923148
## hh_type     -0.01293599  0.078865154  0.01703296 -0.283533287  0.049606456
## hh_income    0.02362338  0.009670883 -0.09589640 -0.063935259 -0.253537174
##                education      hh_size      hh_type    hh_income
## indic_tech1 -0.081985116  0.074872673  0.020731355  0.146320592
## indic_tech2  0.018903777  0.064788380  0.002977609  0.117185841
## indic_tech3  0.001403611  0.077497761  0.069979587  0.144151340
## indic_tech4 -0.025371817  0.060502298  0.032845660  0.023808040
## indic_pack1 -0.130272842 -0.010196959 -0.043381783  0.162056091
## indic_pack2 -0.001618468  0.090571556 -0.101028818  0.021110183
## indic_pack3  0.048413673  0.082307334 -0.036027214 -0.004602605
## indic_pack4 -0.008582287 -0.001595527 -0.047401576 -0.170745685
## indic_envr1  0.007089850  0.028375824 -0.069011303  0.088254329
## indic_envr2 -0.029991498 -0.025198782  0.026079954  0.018096726
## indic_envr3 -0.078030553  0.038210266 -0.012935991  0.023623377
## indic_envr4  0.027257034 -0.004811397  0.078865154  0.009670883
## age         -0.132734166 -0.287947204  0.017032965 -0.095896397
## ethnicity   -0.158198201 -0.186132449 -0.283533287 -0.063935259
## occup       -0.134291639 -0.339923148  0.049606456 -0.253537174
## education    1.000000000  0.078632221  0.024202630 -0.013641899
## hh_size      0.078632221  1.000000000 -0.144253064  0.255828102
## hh_type      0.024202630 -0.144253064  1.000000000  0.071197550
## hh_income   -0.013641899  0.255828102  0.071197550  1.000000000

We can now pass cor.mat.vals to corrplot() for some really nice plots.

corrplot(cor.mat.vals)

corrplot(cor.mat.vals, type="upper", order="hclust")

corrplot(cor.mat.vals, order="hclust", addrect = 9)

corrplot.mixed(cor.mat.vals, tl.pos = "lt")

Advanced corrplot()

For the faint of heart, feel free to move to the next section.

For the brave 💪 , here’s the deal. Correlations, like almost anything in statistcs, can be insignificant. I’m sure you still remember your STAT 101, so let’s do some manual labor.

We’re going to:

  1. pull the standard errors from our hetcor object
  2. get the tstat by deviding the correlation by the standard error
  3. get the pvalue using pt() (our sample size is large enough to not worry about the math for degrees of freedom)
  4. pass that info to corrplot()

Much nicer, yea?

# advanced
cor.mat.std = cor.mat$std.err
cor.mat.tstat = cor.mat.vals/cor.mat.std
cor.mat.pval = 2*pt(cor.mat.tstat, df=Inf, lower=FALSE)

corrplot(cor.mat.vals, order="hclust", addrect = 9, p.mat = cor.mat.pval, sig.level = 0.05)

corrplot(cor.mat.vals, order="hclust", addrect = 9, p.mat = cor.mat.pval, sig.level = 0.05, insig="blank", diag=FALSE)

Principal Component Analysis

Source: Jeroen Boeye, DataCamp

We will run PCA on our 12 indicators. While our indicators have similar scales, it is good practice to always scale your variables before they go into PCA otherwise you will get erronous results (especially in cases where scales could really differ, like mass vs. height).

df.indic = df %>%
    select(indic_tech1:indic_envr4)

pcomp.mat = df.indic %>%
    prcomp(center = TRUE, scale. = TRUE)

print(pcomp.mat)    # shows the loadings
## Standard deviations (1, .., p=12):
##  [1] 1.7420835 1.3071122 1.2973273 1.0216382 0.9557096 0.8738682 0.7953705
##  [8] 0.7426541 0.7122670 0.6564703 0.6369985 0.5697189
## 
## Rotation (n x k) = (12 x 12):
##                    PC1         PC2        PC3         PC4         PC5
## indic_tech1 -0.3628915  0.06340829 -0.3670750  0.11433251 -0.07803006
## indic_tech2 -0.4151267  0.12100072 -0.3320940 -0.02504121  0.12848573
## indic_tech3 -0.3711037  0.09050247 -0.3187055 -0.12497904  0.10261559
## indic_tech4 -0.2514210  0.07798001 -0.2053699  0.45328374 -0.31965569
## indic_pack1 -0.1636596 -0.11090041 -0.2055662 -0.76019802 -0.13514068
## indic_pack2 -0.1929340  0.36234152  0.3669573 -0.10087654 -0.44171391
## indic_pack3 -0.2138255  0.32642444  0.4081716 -0.24427947 -0.19974891
## indic_pack4 -0.2476521  0.42677216  0.2138707  0.14476723  0.27223238
## indic_envr1 -0.2883997 -0.44621274  0.2377464 -0.16165517  0.03911598
## indic_envr2 -0.2719006 -0.46190386  0.2364802  0.09802108  0.02465941
## indic_envr3 -0.2599730  0.03895505  0.2687675  0.01546546  0.67029025
## indic_envr4 -0.3151157 -0.34757802  0.1930737  0.24602497 -0.29305557
##                      PC6         PC7         PC8         PC9         PC10
## indic_tech1  0.185098284 -0.01058000  0.60478944 -0.22369128  0.037685845
## indic_tech2  0.201706545 -0.05284361 -0.02765019  0.00341034  0.290264443
## indic_tech3  0.282642717 -0.15215600 -0.60415874  0.14099338 -0.230582426
## indic_tech4 -0.687358386  0.12318936 -0.27172071 -0.01749997  0.008787226
## indic_pack1 -0.420692763  0.13342010  0.20639752  0.10462702 -0.227195346
## indic_pack2  0.150908944 -0.04600233  0.05720236  0.61582429  0.085834842
## indic_pack3 -0.011186437 -0.19889778 -0.16652267 -0.70964545 -0.051526292
## indic_pack4 -0.004237076  0.60591577  0.15836368  0.01220839 -0.087821626
## indic_envr1 -0.020900116  0.25910687 -0.16108129 -0.03462005  0.691706793
## indic_envr2  0.190922724  0.34682568 -0.07032155 -0.02530810 -0.546005166
## indic_envr3 -0.367351463 -0.42351001  0.11541054  0.17793154 -0.050116634
## indic_envr4  0.031148681 -0.40665959  0.22744774  0.04831613 -0.112411352
##                    PC11         PC12
## indic_tech1 -0.34243240 -0.376195310
## indic_tech2  0.05443464  0.744403036
## indic_tech3  0.11516223 -0.414743905
## indic_tech4 -0.13684149  0.017479739
## indic_pack1  0.14431082  0.041493440
## indic_pack2 -0.27681680  0.011984629
## indic_pack3 -0.07494559  0.047601123
## indic_pack4  0.45278138 -0.103061551
## indic_envr1 -0.03222842 -0.243553138
## indic_envr2 -0.36009241  0.237618324
## indic_envr3 -0.21414154 -0.038432395
## indic_envr4  0.60261613 -0.003553337
summary(pcomp.mat)  # shows summary statistics
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.7421 1.3071 1.2973 1.02164 0.95571 0.87387 0.79537
## Proportion of Variance 0.2529 0.1424 0.1403 0.08698 0.07612 0.06364 0.05272
## Cumulative Proportion  0.2529 0.3953 0.5355 0.62252 0.69863 0.76227 0.81499
##                            PC8     PC9    PC10    PC11    PC12
## Standard deviation     0.74265 0.71227 0.65647 0.63700 0.56972
## Proportion of Variance 0.04596 0.04228 0.03591 0.03381 0.02705
## Cumulative Proportion  0.86095 0.90322 0.93914 0.97295 1.00000

12 components are… alot! Actually, you’ll get as many components as there are variables. How is this dimensionality reduction though? The reduction is due to the fact that not all of these 12 components are equally important. What we want is to use the components that explain as much of the variance in our data. Conveniently, summary() outputs this information for us.

Let’s visualize this. First, let’s look at how to pull the data from summary(pcomp.mat) using str(). A quarer from the bottom, we can see that the importance information is stored in summary(pcomp.mat)$importance.

str(summary(pcomp.mat))
## List of 6
##  $ sdev      : num [1:12] 1.742 1.307 1.297 1.022 0.956 ...
##  $ rotation  : num [1:12, 1:12] -0.363 -0.415 -0.371 -0.251 -0.164 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:12] "indic_tech1" "indic_tech2" "indic_tech3" "indic_tech4" ...
##   .. ..$ : chr [1:12] "PC1" "PC2" "PC3" "PC4" ...
##  $ center    : Named num [1:12] 3.87 4.12 3.29 4.56 2.87 ...
##   ..- attr(*, "names")= chr [1:12] "indic_tech1" "indic_tech2" "indic_tech3" "indic_tech4" ...
##  $ scale     : Named num [1:12] 0.824 0.906 1.035 0.713 1.113 ...
##   ..- attr(*, "names")= chr [1:12] "indic_tech1" "indic_tech2" "indic_tech3" "indic_tech4" ...
##  $ x         : num [1:300, 1:12] 0.0111 -2.6093 -0.1299 2.6068 1.8813 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:12] "PC1" "PC2" "PC3" "PC4" ...
##  $ importance: num [1:3, 1:12] 1.742 0.253 0.253 1.307 0.142 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:3] "Standard deviation" "Proportion of Variance" "Cumulative Proportion"
##   .. ..$ : chr [1:12] "PC1" "PC2" "PC3" "PC4" ...
##  - attr(*, "class")= chr "summary.prcomp"

The first row in importance gives us the standard deviation, so let’s square that to get the variance.

summary(pcomp.mat)$importance[1,]^2 %>%
    plot(type="b", col="green4", lwd=2, xlab="PC", ylab="variance")

While traditionally people look at the varaince plot, I find importance (or proportion of variance more intuitive), so let’s look at that. Notice that the first 4 principal components alone explain ~60% of the variance among these 12 indicators?

summary(pcomp.mat)$importance[2,] %>%
    plot(type="b", col="red", lwd=2, xlab="PC", ylab="importance")

summary(pcomp.mat)$importance[3,] %>%
    plot(type="b", col="royalblue", lwd=2, xlab="PC", ylab="cumulative importance")

Should’ve told you that you could have just used screeplot()… but now you know how to extract data from objects!

screeplot(pcomp.mat, type="lines", npcs=ncol(df.indic))

Visualizing PCA

A way to visualize PCA is to create biplots(). These plots show you the principal components (the vectors) along with your data on the 2D plane created by every PC-pair. Here’s an example below.

biplot(pcomp.mat)

Doesn’t make much sense, does it?

Let’s use ggbiplot to plot the biplot instead. This plot does provide us with a bit extra information (explained variance) and nicer graphics… but there’s still a lot going on.

library(ggbiplot)
## Loading required package: plyr
## Warning: package 'plyr' was built under R version 4.1.3
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following object is masked from 'package:purrr':
## 
##     compact
## Loading required package: scales
## Warning: package 'scales' was built under R version 4.1.3
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
## Loading required package: grid
ggbiplot(pcomp.mat)

Let’s go through PCA with a smaller dataset. Let’s take just 3 indicators out of the 12 and work on them instead.

pcomp.mat.trunc = df %>%
    select(indic_tech1:indic_tech3) %>%
    prcomp(center = TRUE, scale. = TRUE)

print(pcomp.mat.trunc)
## Standard deviations (1, .., p=3):
## [1] 1.4496274 0.7437608 0.5877077
## 
## Rotation (n x k) = (3 x 3):
##                    PC1         PC2        PC3
## indic_tech1 -0.5556781 -0.73543052  0.3877678
## indic_tech2 -0.6103638  0.04417084 -0.7908887
## indic_tech3 -0.5645157  0.67615892  0.4734249
summary(pcomp.mat.trunc)
## Importance of components:
##                           PC1    PC2    PC3
## Standard deviation     1.4496 0.7438 0.5877
## Proportion of Variance 0.7005 0.1844 0.1151
## Cumulative Proportion  0.7005 0.8849 1.0000
ggbiplot(pcomp.mat.trunc)

The above grid looks cleaner, but it’s still a bunch of random dots. If we look at PC1 and PC3 instead though, we see some order emerging from this noise.

ggbiplot(pcomp.mat.trunc, ellipse=TRUE, choices=c(1,3))

Let’s add colors and… this now makes more sense. Look at how the principal component for each indicator points in the direction of the color gradient and along the ellipses.

ggbiplot(pcomp.mat.trunc, ellipse=TRUE, choices=c(1,3), groups=df$indic_tech1)

ggbiplot(pcomp.mat.trunc, ellipse=TRUE, choices=c(1,3), groups=df$indic_tech2)

ggbiplot(pcomp.mat.trunc, ellipse=TRUE, choices=c(1,3), groups=df$indic_tech3)

We can now also see that for the earlier PC1-PC2 plot.

ggbiplot(pcomp.mat.trunc, ellipse=TRUE, choices=c(1,2), groups=df$indic_tech1)

ggbiplot(pcomp.mat.trunc, ellipse=TRUE, choices=c(1,2), groups=df$indic_tech2)

ggbiplot(pcomp.mat.trunc, ellipse=TRUE, choices=c(1,2), groups=df$indic_tech3)

… and PC2-PC3.

ggbiplot(pcomp.mat.trunc, ellipse=TRUE, choices=c(2,3), groups=df$indic_tech1)

ggbiplot(pcomp.mat.trunc, ellipse=TRUE, choices=c(2,3), groups=df$indic_tech2)

ggbiplot(pcomp.mat.trunc, ellipse=TRUE, choices=c(2,3), groups=df$indic_tech3)

If we go back to our earlier PCA with 12 indicators, we can now visualize and understand what’s happening too.

ggbiplot(pcomp.mat, ellipse=TRUE, choices=c(1,2), groups=df$indic_envr1)

Extracting more info from PCA

It’s easy to extract the Eigen value from our pcomp.mat manually as below (just square the standard deviation values).

# eigen value manually
pcomp.mat$sdev^2
##  [1] 3.0348549 1.7085424 1.6830580 1.0437446 0.9133809 0.7636456 0.6326142
##  [8] 0.5515351 0.5073243 0.4309533 0.4057671 0.3245797

… or you use FactoMineR which does that for you and provides some more information.

library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 4.1.3
pcomp.fmr = df.indic %>%
    PCA(scale.unit=TRUE, ncp=ncol(df.indic))

pcomp.fmr$eig               # eigen value
##         eigenvalue percentage of variance cumulative percentage of variance
## comp 1   3.0348549              25.290457                          25.29046
## comp 2   1.7085424              14.237853                          39.52831
## comp 3   1.6830580              14.025484                          53.55379
## comp 4   1.0437446               8.697872                          62.25167
## comp 5   0.9133809               7.611508                          69.86317
## comp 6   0.7636456               6.363713                          76.22689
## comp 7   0.6326142               5.271785                          81.49867
## comp 8   0.5515351               4.596126                          86.09480
## comp 9   0.5073243               4.227702                          90.32250
## comp 10  0.4309533               3.591278                          93.91378
## comp 11  0.4057671               3.381392                          97.29517
## comp 12  0.3245797               2.704831                         100.00000
head(pcomp.fmr$ind$coord)   # PC scores
##        Dim.1      Dim.2        Dim.3       Dim.4      Dim.5      Dim.6
## 1 -0.0110790 -0.5065151  0.166701915  0.02175366  0.2053319 -1.3911214
## 2  2.6136698  0.3956022  1.538806249 -0.77501400 -0.2051026 -0.3070101
## 3  0.1301143 -0.5843495  0.001776618 -0.05204866  0.8752614  0.2264773
## 4 -2.6111766  1.1135644 -0.340856115  0.80767682 -1.1037339  0.6448778
## 5 -1.8844613  3.6884963  0.615104145  1.77318812 -0.8543659  0.2161761
## 6  1.4516476  2.0385783 -1.873562675  0.33788039  1.8722497 -0.6091732
##          Dim.7       Dim.8       Dim.9     Dim.10     Dim.11      Dim.12
## 1  0.003287645 -0.59148801 -0.36898112 -0.1994285  0.2433591  0.43750339
## 2 -0.135256812 -0.52847147 -0.07248417  0.2557436  0.3060869 -0.47164927
## 3  0.443644434 -0.20268946 -0.43591327 -0.2305733  0.4172266 -0.02583672
## 4 -0.305349777  0.09161305 -0.75749171 -0.1798158  0.1937597 -0.83126638
## 5  0.010380806 -1.77914361  0.39979125 -0.3092759 -0.1833253 -0.19789779
## 6 -0.101585032  0.33478485  0.34141102  0.4193701 -1.3986974  0.45878190
pcomp.fmr$var$coord         # correlations between variables and PCs
##                 Dim.1       Dim.2      Dim.3       Dim.4       Dim.5
## indic_tech1 0.6321874  0.08288176 -0.4762164 -0.11680646  0.07457409
## indic_tech2 0.7231854  0.15816152 -0.4308346  0.02558306 -0.12279505
## indic_tech3 0.6464937  0.11829688 -0.4134654  0.12768336 -0.09807071
## indic_tech4 0.4379964  0.10192863 -0.2664319 -0.46309198  0.30549802
## indic_pack1 0.2851087 -0.14495928 -0.2666867  0.77664734  0.12915525
## indic_pack2 0.3361071  0.47362103  0.4760637  0.10305932  0.42215024
## indic_pack3 0.3725019  0.42667338  0.5295322  0.24956523  0.19090196
## indic_pack4 0.4314307  0.55783911  0.2774603 -0.14789973 -0.26017511
## indic_envr1 0.5024164 -0.58325013  0.3084348  0.16515310 -0.03738352
## indic_envr2 0.4736735 -0.60376019  0.3067922 -0.10014208 -0.02356724
## indic_envr3 0.4528946  0.05091862  0.3486794 -0.01580011 -0.64060286
## indic_envr4 0.5489578 -0.45432348  0.2504798 -0.25134850  0.28007604
##                    Dim.6        Dim.7       Dim.8        Dim.9       Dim.10
## indic_tech1 -0.161751496 -0.008415016  0.44914936 -0.159327917 -0.024739639
## indic_tech2 -0.176264926 -0.042030250 -0.02053453  0.002429073 -0.190549997
## indic_tech3 -0.246992470 -0.121020387 -0.44868098  0.100424927  0.151370524
## indic_tech4  0.600660605  0.097981177 -0.20179450 -0.012464654 -0.005768553
## indic_pack1  0.367630008  0.106118405  0.15328197  0.074522376  0.149147006
## indic_pack2 -0.131874520 -0.036588891  0.04248157  0.438631310 -0.056348028
## indic_pack3  0.009775471 -0.158197416 -0.12366874 -0.505457025  0.033825483
## indic_pack4  0.003702646  0.481927499  0.11760943  0.008695635  0.057652292
## indic_envr1  0.018263946  0.206085946 -0.11962769 -0.024658715 -0.454084993
## indic_envr2 -0.166841288  0.275854900 -0.05222459 -0.018026123  0.358436197
## indic_envr3  0.321016745 -0.336847352  0.08571012  0.126734764  0.032900084
## indic_envr4 -0.027219840 -0.323445023  0.16891500  0.034413985  0.073794719
##                  Dim.11       Dim.12
## indic_tech1  0.21812892  0.214325593
## indic_tech2 -0.03467479 -0.424100507
## indic_tech3 -0.07335817  0.236287457
## indic_tech4  0.08716782 -0.009958539
## indic_pack1 -0.09192578 -0.023639598
## indic_pack2  0.17633189 -0.006827870
## indic_pack3  0.04774023 -0.027119261
## indic_pack4 -0.28842106  0.058716117
## indic_envr1  0.02052945  0.138756835
## indic_envr2  0.22937832 -0.135375659
## indic_envr3  0.13640784  0.021895663
## indic_envr4 -0.38386557  0.002024404

Screeplots (advanced)

With the library nFactors, we can get a more advanced screeplot to decide on the number of principal components that we want to use (and in the sections below, the number of factors in factor analysis).

Run the code below.

library(nFactors)
## Loading required package: lattice
## 
## Attaching package: 'nFactors'
## The following object is masked from 'package:lattice':
## 
##     parallel
ev = eigen(hetcor(df.indic, use = "pairwise.complete.obs")$correlations)
ap = parallel(subject=nrow(df.indic), var=ncol(df.indic), rep=100, cent=.05)
nS = nScree(x=ev$values, aparallel=ap$eigen$qevpea)
plot(nS)

As you can see in the plot, nScree uses four methods to determine the best number of components/factors. The four values do not have to agree. Here, it seems like 3 or 4 are the values to choose.

For more on this and the details for each method, check out ?nScree.

Exploratory Factor Analysis

Exploratory factor analysis (EFA) is the close cousin of PCA. Instead, here, you provide the number of factors and the indicators will be grouped across the factors. Let’s start with 4 factors following our scree plot above. We’re use a "promax" rotation. Refer to ?factanal or your course notes for a discussion on rotation.

efa = factanal(df.indic, 4, rotation="promax")
print(efa)
## 
## Call:
## factanal(x = df.indic, factors = 4, rotation = "promax")
## 
## Uniquenesses:
## indic_tech1 indic_tech2 indic_tech3 indic_tech4 indic_pack1 indic_pack2 
##       0.523       0.237       0.523       0.843       0.891       0.579 
## indic_pack3 indic_pack4 indic_envr1 indic_envr2 indic_envr3 indic_envr4 
##       0.523       0.631       0.063       0.615       0.825       0.005 
## 
## Loadings:
##             Factor1 Factor2 Factor3 Factor4
## indic_tech1  0.691                   0.109 
## indic_tech2  0.884                         
## indic_tech3  0.697                         
## indic_tech4  0.339  -0.107           0.174 
## indic_pack1  0.258   0.234          -0.144 
## indic_pack2                  0.656         
## indic_pack3                  0.701         
## indic_pack4  0.184           0.550  -0.138 
## indic_envr1          1.001                 
## indic_envr2          0.456           0.254 
## indic_envr3          0.173   0.288         
## indic_envr4                          1.014 
## 
##                Factor1 Factor2 Factor3 Factor4
## SS loadings      1.980   1.323   1.318   1.191
## Proportion Var   0.165   0.110   0.110   0.099
## Cumulative Var   0.165   0.275   0.385   0.484
## 
## Factor Correlations:
##         Factor1 Factor2 Factor3 Factor4
## Factor1   1.000  -0.520  -0.281   0.166
## Factor2  -0.520   1.000   0.242  -0.147
## Factor3  -0.281   0.242   1.000  -0.274
## Factor4   0.166  -0.147  -0.274   1.000
## 
## Test of the hypothesis that 4 factors are sufficient.
## The chi square statistic is 39.05 on 24 degrees of freedom.
## The p-value is 0.027

You can see how our indicators are sorting themselves out across groups above. Don’t be tricked though! This is only because the print() method for factanal() has a default cutoff of 0.10. Let’s remove the cutoff to see our real results.

print(efa, cutoff = 0)
## 
## Call:
## factanal(x = df.indic, factors = 4, rotation = "promax")
## 
## Uniquenesses:
## indic_tech1 indic_tech2 indic_tech3 indic_tech4 indic_pack1 indic_pack2 
##       0.523       0.237       0.523       0.843       0.891       0.579 
## indic_pack3 indic_pack4 indic_envr1 indic_envr2 indic_envr3 indic_envr4 
##       0.523       0.631       0.063       0.615       0.825       0.005 
## 
## Loadings:
##             Factor1 Factor2 Factor3 Factor4
## indic_tech1  0.691  -0.092  -0.071   0.109 
## indic_tech2  0.884   0.009  -0.001  -0.048 
## indic_tech3  0.697   0.020   0.001  -0.042 
## indic_tech4  0.339  -0.107   0.017   0.174 
## indic_pack1  0.258   0.234  -0.079  -0.144 
## indic_pack2 -0.067  -0.068   0.656   0.085 
## indic_pack3 -0.080   0.037   0.701   0.012 
## indic_pack4  0.184  -0.018   0.550  -0.138 
## indic_envr1 -0.024   1.001  -0.008  -0.054 
## indic_envr2 -0.006   0.456  -0.026   0.254 
## indic_envr3  0.090   0.173   0.288   0.047 
## indic_envr4 -0.018  -0.027   0.013   1.014 
## 
##                Factor1 Factor2 Factor3 Factor4
## SS loadings      1.980   1.323   1.318   1.191
## Proportion Var   0.165   0.110   0.110   0.099
## Cumulative Var   0.165   0.275   0.385   0.484
## 
## Factor Correlations:
##         Factor1 Factor2 Factor3 Factor4
## Factor1   1.000  -0.520  -0.281   0.166
## Factor2  -0.520   1.000   0.242  -0.147
## Factor3  -0.281   0.242   1.000  -0.274
## Factor4   0.166  -0.147  -0.274   1.000
## 
## Test of the hypothesis that 4 factors are sufficient.
## The chi square statistic is 39.05 on 24 degrees of freedom.
## The p-value is 0.027

Cutoffs are not bad though. They allow us to identify proper groupings as very small loadings are noisy. Let’s actually use a higher (good practice) cutoff of 0.3.

print(efa, cutoff = 0.3)
## 
## Call:
## factanal(x = df.indic, factors = 4, rotation = "promax")
## 
## Uniquenesses:
## indic_tech1 indic_tech2 indic_tech3 indic_tech4 indic_pack1 indic_pack2 
##       0.523       0.237       0.523       0.843       0.891       0.579 
## indic_pack3 indic_pack4 indic_envr1 indic_envr2 indic_envr3 indic_envr4 
##       0.523       0.631       0.063       0.615       0.825       0.005 
## 
## Loadings:
##             Factor1 Factor2 Factor3 Factor4
## indic_tech1  0.691                         
## indic_tech2  0.884                         
## indic_tech3  0.697                         
## indic_tech4  0.339                         
## indic_pack1                                
## indic_pack2                  0.656         
## indic_pack3                  0.701         
## indic_pack4                  0.550         
## indic_envr1          1.001                 
## indic_envr2          0.456                 
## indic_envr3                                
## indic_envr4                          1.014 
## 
##                Factor1 Factor2 Factor3 Factor4
## SS loadings      1.980   1.323   1.318   1.191
## Proportion Var   0.165   0.110   0.110   0.099
## Cumulative Var   0.165   0.275   0.385   0.484
## 
## Factor Correlations:
##         Factor1 Factor2 Factor3 Factor4
## Factor1   1.000  -0.520  -0.281   0.166
## Factor2  -0.520   1.000   0.242  -0.147
## Factor3  -0.281   0.242   1.000  -0.274
## Factor4   0.166  -0.147  -0.274   1.000
## 
## Test of the hypothesis that 4 factors are sufficient.
## The chi square statistic is 39.05 on 24 degrees of freedom.
## The p-value is 0.027

The indic_tech and indic_pack indicators seem to group up with each other quite nicely. indic_tech variables are a bit problematic. First, we don’t want factors with a single indicators because that doesn’t help in reducing dimensionality or in adding much explanation to a latent variable. Factors with only 2 indicators are also discouraged (try to have at least 3).

Let’s run an EFA with 3 factors.

efa = factanal(df.indic, 3, rotation="promax")
print(efa, cutoff = 0.3)
## 
## Call:
## factanal(x = df.indic, factors = 3, rotation = "promax")
## 
## Uniquenesses:
## indic_tech1 indic_tech2 indic_tech3 indic_tech4 indic_pack1 indic_pack2 
##       0.537       0.235       0.525       0.865       0.925       0.581 
## indic_pack3 indic_pack4 indic_envr1 indic_envr2 indic_envr3 indic_envr4 
##       0.512       0.661       0.462       0.431       0.819       0.605 
## 
## Loadings:
##             Factor1 Factor2 Factor3
## indic_tech1  0.698                 
## indic_tech2  0.892                 
## indic_tech3  0.693                 
## indic_tech4  0.354                 
## indic_pack1                        
## indic_pack2                  0.668 
## indic_pack3                  0.715 
## indic_pack4                  0.531 
## indic_envr1          0.746         
## indic_envr2          0.783         
## indic_envr3                        
## indic_envr4          0.597         
## 
##                Factor1 Factor2 Factor3
## SS loadings      2.003   1.599   1.334
## Proportion Var   0.167   0.133   0.111
## Cumulative Var   0.167   0.300   0.411
## 
## Factor Correlations:
##         Factor1 Factor2 Factor3
## Factor1   1.000   0.333   0.291
## Factor2   0.333   1.000   0.223
## Factor3   0.291   0.223   1.000
## 
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 72.74 on 33 degrees of freedom.
## The p-value is 8.11e-05

This looks much better. We have 3 factors, one for each of tech, envr and pack. It’s fine if we don’t get to use all indicators (it’s why we collect many of them).

Confirmatory Factor Analysis

Remember how in EFA we implement a printing cutoff so we can more easily identify which indicators should be place under which factors? In the process of obfuscating information we could be making erroneous conclusions. That’s what CFA is for. Here, instead, we place the indicators within the factors we determined they belong to and test to see if the loadings are significant and if the model has good fit.

We’re using lavaan, a very powerful (and well documented!) package for CFA (and down the line, structural equation modeling). For more on lavaan, check out: https://lavaan.ugent.be/tutorial/index.html

Below, we test our construct from the above EFA analysis we just did. All loadings are significant! What else do you find interesting in this model?

library(lavaan)
## Warning: package 'lavaan' was built under R version 4.1.2
## This is lavaan 0.6-10
## lavaan is FREE software! Please report any bugs.
model = "
    tech =~ indic_tech1 + indic_tech2 + indic_tech3 + indic_tech4
    pack =~ indic_pack2 + indic_pack3 + indic_pack4
    envr =~ indic_envr1 + indic_envr2 + indic_envr3
"
cfa = cfa(model, df.indic)
summary(cfa, fit.measures=TRUE)
## lavaan 0.6-10 ended normally after 32 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                        23
##                                                       
##   Number of observations                           300
##                                                       
## Model Test User Model:
##                                                       
##   Test statistic                                64.495
##   Degrees of freedom                                32
##   P-value (Chi-square)                           0.001
## 
## Model Test Baseline Model:
## 
##   Test statistic                               653.118
##   Degrees of freedom                                45
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    0.947
##   Tucker-Lewis Index (TLI)                       0.925
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -3701.997
##   Loglikelihood unrestricted model (H1)      -3669.749
##                                                       
##   Akaike (AIC)                                7449.994
##   Bayesian (BIC)                              7535.181
##   Sample-size adjusted Bayesian (BIC)         7462.238
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.058
##   90 Percent confidence interval - lower         0.037
##   90 Percent confidence interval - upper         0.079
##   P-value RMSEA <= 0.05                          0.239
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.066
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   tech =~                                             
##     indic_tech1       1.000                           
##     indic_tech2       1.431    0.137   10.447    0.000
##     indic_tech3       1.292    0.128   10.070    0.000
##     indic_tech4       0.464    0.083    5.596    0.000
##   pack =~                                             
##     indic_pack2       1.000                           
##     indic_pack3       1.078    0.171    6.315    0.000
##     indic_pack4       0.626    0.100    6.268    0.000
##   envr =~                                             
##     indic_envr1       1.000                           
##     indic_envr2       0.863    0.170    5.062    0.000
##     indic_envr3       0.388    0.095    4.072    0.000
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   tech ~~                                             
##     pack              0.080    0.028    2.866    0.004
##     envr              0.114    0.037    3.072    0.002
##   pack ~~                                             
##     envr              0.079    0.044    1.790    0.073
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .indic_tech1       0.371    0.038    9.782    0.000
##    .indic_tech2       0.192    0.047    4.079    0.000
##    .indic_tech3       0.556    0.059    9.427    0.000
##    .indic_tech4       0.441    0.037   11.882    0.000
##    .indic_pack2       0.462    0.067    6.897    0.000
##    .indic_pack3       0.508    0.076    6.651    0.000
##    .indic_pack4       0.358    0.037    9.731    0.000
##    .indic_envr1       0.412    0.141    2.921    0.003
##    .indic_envr2       0.615    0.114    5.400    0.000
##    .indic_envr3       1.008    0.086   11.731    0.000
##     tech              0.306    0.052    5.935    0.000
##     pack              0.375    0.079    4.722    0.000
##     envr              0.728    0.162    4.489    0.000

As a final bonus step, we can plot our CFA model using the package lavaanPlot.

lavaanPlot::lavaanPlot(model = cfa, coefs = TRUE)

Congratulations! You should now be able to do exploratory analysis and dimensionality reduction on your own. 🎉